An Efficient Data Indexing Approach on Hadoop Using Java Persistence API

نویسندگان

  • Lai Yang
  • Zhongzhi Shi
چکیده

Data indexing is common in data mining when working with high-dimensional, large-scale data sets. Hadoop, a cloud computing project using the MapReduce framework in Java, has become of significant interest in distributed data mining. To resolve problems of globalization, random-write and duration in Hadoop, a data indexing approach on Hadoop using the Java Persistence API (JPA) is elaborated in the implementation of a KD-tree algorithm on Hadoop. An improved intersection algorithm for distributed data indexing on Hadoop is proposed, it performs O(M+logN), and is suitable for occasions of multiple intersections. We compare the data indexing algorithm on open dataset and synthetic dataset in a modest cloud environment. The results show the algorithms are feasible in large-scale data mining.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Engineering Projects and Support (Pre and Final Semester)

Open Source Science Clouds Cloud Computing hosting Architecture Architecture of Web-EDA system based on Cloud computing and application for project management of IC design An Efficient Data Mining Framework on Hadoop using Java Persistence API Volunteer Computing and Desktop Cloud: The Cloud@Home Paradigm Model inter comparison study: cloud-radiative forcing and feedback's A Taxonomy and Survey...

متن کامل

Engineering Projects and Support (Pre and Final Semester)

Open Source Science Clouds Cloud Computing hosting Architecture Architecture of Web-EDA system based on Cloud computing and application for project management of IC design An Efficient Data Mining Framework on Hadoop using Java Persistence API Volunteer Computing and Desktop Cloud: The Cloud@Home Paradigm Model inter comparison study: cloud-radiative forcing and feedback's A Taxonomy and Survey...

متن کامل

TRANSACTIONS ON BIG DATA 1 A Distributed

Java 8 has introduced new capabilities such as lambda expressions and streams which simplify data-parallel computing. However, as a base language for Big Data systems, it still lacks a number of important capabilities such as processing very large datasets and distributing the computation over multiple machines. This paper gives an overview of the Java 8 Streams API and proposes extensions to a...

متن کامل

Transactions on Big Data

Java 8 has introduced new capabilities such as lambda expressions and streams which simplify data-parallel computing. However, as a base language for Big Data systems, it still lacks a number of important capabilities such as processing very large datasets and distributing the computation over multiple machines. This paper gives an overview of the Java 8 Streams API and proposes extensions to a...

متن کامل

JPA Criteria Queries over RDF Data

We present the design and implementation of a prototype system for querying RDF data via the Java Persistence API (JPA) criteria query feature. The JPA is a specification for management of (primarily, but not limited to) relational data. It comprises a set of Java interfaces, annotations, and the JPA query language (JPQL) and thus provides a framework for uniform persistence and retrieval of Ja...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010